equivalent model
- North America > Canada > Ontario > Toronto (0.29)
- Asia > Singapore (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Banking & Finance (0.46)
- Information Technology > Security & Privacy (0.46)
- Government (0.46)
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling
Li, Jiacheng, Tan, Jianchao, Yang, Zhidong, Sun, Pingwei, Huo, Feiye, Qin, Jiayu, Sun, Yerui, Xie, Yuchen, Cai, Xunliang, Zhang, Xiangyu, He, Maoxin, Tan, Guangming, Jia, Weile, Zhao, Tong
Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model's training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.
- North America > Canada > Ontario > Toronto (0.29)
- Asia > Singapore (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Banking & Finance (0.46)
- Information Technology > Security & Privacy (0.46)
- Government (0.46)
Explanation sensitivity to the randomness of large language models: the case of journalistic text classification
Bogaert, Jeremie, de Marneffe, Marie-Catherine, Descampe, Antonin, Escouflaire, Louis, Fairon, Cedrick, Standaert, Francois-Xavier
Large language models (LLMs) perform very well in several natural language processing tasks but raise explainability challenges. In this paper, we examine the effect of random elements in the training of LLMs on the explainability of their predictions. We do so on a task of opinionated journalistic text classification in French. Using a fine-tuned CamemBERT model and an explanation method based on relevance propagation, we find that training with different random seeds produces models with similar accuracy but variable explanations. We therefore claim that characterizing the explanations' statistical distribution is needed for the explainability of LLMs. We then explore a simpler model based on textual features which offers stable explanations but is less accurate. Hence, this simpler model corresponds to a different tradeoff between accuracy and explainability. We show that it can be improved by inserting features derived from CamemBERT's explanations. We finally discuss new research directions suggested by our results, in particular regarding the origin of the sensitivity observed in the training randomness.
- Europe > Belgium > Wallonia > Walloon Brabant > Louvain-la-Neuve (0.04)
- Europe > France (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
A Question on the Explainability of Large Language Models and the Word-Level Univariate First-Order Plausibility Assumption
Bogaert, Jeremie, Standaert, Francois-Xavier
The explanations of large language models have recently been shown to be sensitive to the randomness used for their training, creating a need to characterize this sensitivity. In this paper, we propose a characterization that questions the possibility to provide simple and informative explanations for such models. To this end, we give statistical definitions for the explanations' signal, noise and signal-to-noise ratio. We highlight that, in a typical case study where word-level univariate explanations are analyzed with first-order statistical tools, the explanations of simple feature-based models carry more signal and less noise than those of transformer ones. We then discuss the possibility to improve these results with alternative definitions of signal and noise that would capture more complex explanations and analysis methods, while also questioning the tradeoff with their plausibility for readers.
- North America > United States > Maryland > Baltimore (0.04)
- Europe > Belgium (0.04)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.47)
Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning
Kastner, Tyler, Erdogdu, Murat A., Farahmand, Amir-massoud
We consider the problem of learning models for risk-sensitive reinforcement learning. We theoretically demonstrate that proper value equivalence, a method of learning models which can be used to plan optimally in the risk-neutral setting, is not sufficient to plan optimally in the risk-sensitive setting. We leverage distributional reinforcement learning to introduce two new notions of model equivalence, one which is general and can be used to plan for any risk measure, but is intractable; and a practical variation which allows one to choose which risk measures they may plan optimally for. We demonstrate how our framework can be used to augment any model-free risk-sensitive algorithm, and provide both tabular and large-scale experiments to demonstrate its ability.
Characterization and Greedy Learning of Gaussian Structural Causal Models under Unknown Interventions
Gamella, Juan L., Taeb, Armeen, Heinze-Deml, Christina, Bühlmann, Peter
We consider the problem of recovering the causal structure underlying observations from different experimental conditions when the targets of the interventions in each experiment are unknown. We assume a linear structural causal model with additive Gaussian noise and consider interventions that perturb their targets while maintaining the causal relationships in the system. Different models may entail the same distributions, offering competing causal explanations for the given observations. We fully characterize this equivalence class and offer identifiability results, which we use to derive a greedy algorithm called GnIES to recover the equivalence class of the data-generating model without knowledge of the intervention targets. In addition, we develop a novel procedure to generate semi-synthetic data sets with known causal ground truth but distributions closely resembling those of a real data set of choice. We leverage this procedure and evaluate the performance of GnIES on synthetic, real, and semi-synthetic data sets. Despite the strong Gaussian distributional assumption, GnIES is robust to an array of model violations and competitive in recovering the causal structure in small- to large-sample settings. We provide, in the Python packages "gnies" and "sempler", implementations of GnIES and our semi-synthetic data generation procedure.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.87)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.62)
Machine Learning the Phenomenology of COVID-19 From Early Infection Dynamics
We present a robust data-driven machine learning analysis of the COVID-19 pandemic from its early infection dynamics, specifically infection counts over time. The goal is to extract actionable public health insights. These insights include the infectious force, the rate of a mild infection becoming serious, estimates for asymtomatic infections and predictions of new infections over time. We focus on USA data starting from the first confirmed infection on January 20 2020. Our methods reveal significant asymptomatic (hidden) infection, a lag of about 10 days, and we quantitatively confirm that the infectious force is strong with about a 0.14% transition from mild to serious infection. Our methods are efficient, robust and general, being agnostic to the specific virus and applicable to different populations or cohorts.
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.46)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Epidemiology (1.00)